Evaluating supervised topic models in the presence of OCR errors
نویسندگان
چکیده
Supervised topic models are promising tools for text analytics that simultaneously model topical patterns in document collections and relationships between those topics and document metadata, such as timestamps. We examine empirically the effect of OCR noise on the ability of supervised topic models to produce high quality output through a series of experiments in which we evaluate three supervised topic models and a naive baseline on synthetic OCR data having various levels of degradation and on real OCR data from two different decades. The evaluation includes experiments with and without feature selection. Our results suggest that supervised topic models are no better, or at least not much better in terms of their robustness to OCR errors, than unsupervised topic models and that feature selection has the mixed result of improving topic quality while harming metadata prediction quality. For users of topic modeling methods on OCR data, supervised topic models do not yet solve the problem of finding better topics than the original unsupervised topic models.
منابع مشابه
Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
Models of latent document semantics such as the mixture of multinomials model and Latent Dirichlet Allocation have received substantial attention for their ability to discover topical semantics in large collections of text. In an effort to apply such models to noisy optical character recognition (OCR) text output, we endeavor to understand the effect that character-level noise can have on unsup...
متن کاملMinimally Supervised Methods to Correct Optical Character Recognition
Optical character recognition (OCR) is the transformation of an image of handwritten or typed text to raw text. It is used in a range of modern applications that can make a significant impact, so it is important to have robust OCR programs. There are a variety of such programs today, both propriety and open-source. We have implemented a post-processing layer over OCR output from GOCR to reduce ...
متن کاملEvaluating the Performance of Rehabilitated Roadway Base with Geogrid Reinforcement in the Presence of Soil-Geogrid-Interaction
One of the efficient techniques to improve the behavior of the paved road under traffic loads is implementing the geosynthetic material in the sub-base or the soil under the road. In the past years, many researches have been done about this topic, but the study on the effect of soil/load conditions on the performance of the rehabilitated paved road by geogrid in order to investigate the effecti...
متن کاملCorrection of OCR Word Segmentation Errors in Articles from the ACL Collection through Neural Machine Translation Methods
Depending on the quality of the original document, Optical Character Recognition (OCR) can produce a range of errors – from erroneous letters to additional and spurious blank spaces. We applied a sequence-to-sequence machine translation system to correct word-segmentation OCR errors in scientific texts from the ACL collection with an estimated precision and recall above 0.95 on test data. We pr...
متن کاملWised Semi-Supervised Cluster Ensemble Selection: A New Framework for Selecting and Combing Multiple Partitions Based on Prior knowledge
The Wisdom of Crowds, an innovative theory described in social science, claims that the aggregate decisions made by a group will often be better than those of its individual members if the four fundamental criteria of this theory are satisfied. This theory used for in clustering problems. Previous researches showed that this theory can significantly increase the stability and performance of...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013